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1 BACKGROUND 

This letter is a resp onse to the com ments of Serand d2012h on 
lHuangandHeld2012h in Bioinformatics. ISerand d2012h claimed that 
the parameters for the Fi do algorithm should be specified using 
the grid search method in lSerang et al 1 J2010h so as to generate a 
deserved accuracy in performance comparison. It seems that it is an 
argument on parameter tuning. However, it is indeed the issue of 
how to conduct an unbiased performance evaluation for comparing 
different protein inference algorithms. In this letter, we would 
explain why don't we use the grid search for parameter selection in 
lHuangandHeld2012h and show that this procedure may result in an 
over-estimated performance that is unfair to competin g algorithms. 
In fac t, this issue has also been pointed out by lli and Radivoiad 
J2012h. 
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Fig. 1: The correct and incorrect procedure for assessing the 
performance of protein inference algorithms. In model selection, we 
cannot use any ground-truth information that should be only visible 
in the model assessment stage. Otherwise, we may over-estimate the 
actual performance of inference algorithms. 



2 MODEL SELECTION AND ASSESSMENT IN 
PROTEIN INFERENCE 

Machine learning is a cornerstone of modern bioinformatics. 
Meanwhile, an unbiased performance evaluation is undoubtedly 
the cornerstone of mach ine learning research and applications 
dCawlev and Talboi 1201 oh . which provides a clear picture of the 
strengths and weaknesses of existing approaches. 

In the real world application of machine learning methods, there 
are two closely relate d and separate pro blems: model selection 
and model assessment dHastieef fl/.Ll2009h . In model selection, we 
estimate the performance of different models in order to choose 
the best one. In model assessment or performance evaluation, we 
test the prediction error of a final model obtained from the model 
selection process. 

The protein inference problem is an instance of prediction task in 
machine learning as well, as shown in Fig. 1. In model selection, we 
use the peptide-protein bipartite graph as the input to find a "best" 
inference model that produces a vector Y dHuang et a/.[|2012h . Each 
element in Y can be either the probability/score that each protein 
is present or the presence status of each protein (true or false). In 
model assessment, we compare the predicted vector Y with ground- 
truth vector Y to obtain the performance estimates. This is the 
correct procedure for evaluating and comparing protein inference 
algorithms. 
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In contrast, one possible mistake in an incorrect procedure 
is illustrated at the top of Fig. 1: the partial or whole ground- 
truth vector Y is used in the model selection process of the 
protein inference algorithms. The problem is that the inference 
algorithms have an unfair advantage since they "have already 
seen" the absence/presence information in Y that should only be 
available during model assessment. In other words, the ground- 
truth information has leaked to the model selection phase. As 
a result, the performance estimates of inference algorithms will 
be over optimistic. This phenomenon is essentially analogous to 
the selection bias observed in classification or regression due to 
feature selection over al l samples prior to performance evaluation 
dSmialowski et a/.Ll20ld) . 

According to the description in ISerang et al. I d2010h and the 
source codes of Fido, the grid search procedure chooses the set of 
parameters that jointly maximizes the ROC50 score (the average 
sensitivity when allowing between zero and 50 false positives) and 
minimizes the mean squared error (MSE) from an ideally calibrated 
probability. Clearly, it has used the ground-truth information (true 
and false positive labelsf] that should only be available in the model 

1 In the target-decoy database search and evaluation strategy, a protein is 
regarded as a true positive if it comes from the target database and as a false 
positive otherwise. Therefore, the set of target/decoy labels is used as the set 
of ground-truth labels in this context, although some target proteins may be 
false positives. 
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Fig. 2: The effect of using the ground-truth information in the grid 
search procedure of Fido. The grid search procedure finds a set of 
parameters automatically with the ground-truth labels of candidate 
proteins as input. Note that such presence/absence information of 
proteins should not be visible to the inference algorithms if they are 
once again used in the performance evaluation stage for comparing 
different algorithms. To mimic the situation that the ground-truth 
information is unavailable, we assign a zero weight to ROC$q in 
the grid search method and calculate the average area under curve 
(AUC) value as the performance index of "without ground-truth". 



assessment stage. In particular, the grid search procedure selects 
parameters using the ROC50 score as a key factor, which is directly 
related to the final performance index in the model assessment stage. 
Therefore, it is highly possible that over-fitting occurs, i.e., the use 
of grid search will lead to a performance overestimation. 

To check if the grid search method will lead to an over- 
optimistic performance, we conduct the following experiment. As 
this procedure can control for how much weight should be given to 
ROC50 and how much weight should be given to MSE in model 
selection, we first assign a zero weight to ROC50 to roughly mimic 
the situation that the ground-truth information is invisible so that no 
over-estimation occurs. Then, we compare its performance with that 
given by the algorithm when the default non-zero weight is used 
for ROCso- As shown in Fig.2, the performance of Fido will be 
deceased when the ground-truth information (in terms of ROC50) 
is not used in model selection. One may argue that we cannot fully 



attribute the performance gain in grid search to the incorrect use 
of ground-truth information, but at least, it will be unfair to other 
competing algorithms in performance comparison. 



3 SUMMARY 

The fact that over-fitting at the level of model selection can have 
a very substantial deleterious effect in performance evaluation 
has been widely discussed and rec ognized in machine learning 
research 1 Cawlev and Talbotl 1201 (t) and bioinformatics society 
dSmialowski et a/.( " 20101) . In protein inference, we will face the 
same problem as well. The main objective of this letter is to 
highlight this fact and people should be aware of such risk in future 
comparison when developing new protein inference algorithms. 
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